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Abstract. - Complex networks emerge under different conditions through simple rules of 
growth and evolution. Such rules are typically local when dealing with biological systems and 
most social webs. An important deviation from such scenario is provided by communities 
of agents engaged in technology development, such as open source (OS) communities. Here 
we analyze their network structure, showing that it defines a complex weighted network with 
scaling laws at different levels, as measured by looking at e-mail exchanges. We also present 
a simple model of network growth involving non-local rules based on betweenness centrality. 
The model outcomes fit very well the observed scaling laws, suggesting that the overall goals 
of the community and the underlying hierarchical organization play a key role is shaping its 
dynamics. 



Introduction. - Networks predate complexity, from biology to society and technology [1]. 
In many cases, large-scale, system-level properties emerge from local interactions among net- 
work components. This is consistent with the general lack of global goals that pervade cellular 
webs or acquaintance networks. However, when dealing with large-scale technological designs, 
the situation can be rather different. This is particularly true for some communities of de- 
signers working together in a decentralized manner. Open source communities, in particular, 
provide the most interesting example, where software is developed through distributed coop- 
eration among many agents. The software systems are themselves complex networks [2,3,4], 
which have been shown to display small world and scale-free architecture. In this paper we 
analyse the global organization of these problem-solving communities and the possible rules 
of self-organization that drive their evolution as weighted networks. 

Following [5], we have analyzed the structure and modeled the evolution of social inter- 
action in OS communities [6]. Here e-mail is an important vehicle of communication and 
we can recover social interactions by analyzing the full register of e-mails exchanged between 
community members. From this dataset, we have focused on the subset of e-mails describing 
new software errors (bug tracking) and in the subsequent e-mail discussion on how to solve 
the error (bug fixing). Nodes Vi 6 V in the social network G = (V,L) represent community 
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Fig. 1 - Social network of e-mail exchanges in open source communities display hierarchical features. 
Line thickness represents the number of e-mails flowing from the sender to the receiver. Darker nodes 
and links denote active members and frequent communication, respectively. (A) Social network for 
the Amavis community has N — 98 members, where the three center nodes display the largest traffic 
loads. (B) Social network for the TCL community has N = 215 members and average degree (k) m 3. 
There is a small subgraph of core members (i.e., the hierarchical backbone) concentrating the bulk 
of e-mail traffic. Note how strong edges connect nodes with heavy traffic load. 

members while directed links (i, j) G L denote e-mail communication whether the member i 
replies to the member j. At time t, a member i>j discovers a new software error (bug) and 
sends a notification e-mail. Then, other members investigate the origin of the software bug 
and eventually reply to the message, either explaining the solution or asking for more infor- 
mation. Here Eij(t) = 1 if developer i replies to developer j at time t and is zero otherwise. 
Link weight is the total amount of e-mail traffic flowing from developer i to developer j: 



where T is the timespan of software development. We have found that e-mail traffic is highly 
symmetric, i. e. « e^. In order to measure link symmetry, we introduce a weighted 
measure of link reciprocity [7] namely the link weight reciprocity p w , defined as 



where e = J2i^j e ij/N(N — 1) is the average link weight. This coefficient enables us to differ- 
entiate between weighted reciprocal networks (p w > 0) and weighted antireciprocal networks 
(p w < 0). The neutral case is given by p w « 0. All systems analyzed here display strong 
symmetry, with p w w 1. This pattern can be explained in terms of fair reciprocity [8], where 
any member replies to every received e-mail. 

In the following, we will focus in the analysis of the undirected (and weighted) graph. Let 
us define edge weight (interaction strength) as Wij — + eji, which provides a measure of 
traffic exchanges between any pair of members. Two measures of node centrality are frequenly 
used to evaluate node importance. A global centrality measure is betweeness centrality bi [9] 
(i. e. node load [10]) measured as the number of shortests paths passing through the node i. 
Node strength [11] is a local measure defined as 



T 




(1) 



Ei^j ( e y - e) 2 



(2) 
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Fig. 2 - (A) Average betweeness centrality vs degree {b(k)) ~ k v where r\ ps 1.59 for the Python 
OS community. This exponent is close to the theoretical prediction rjBA ~ (7 — l)/(<5 — 1) = 1.70 
(see text). (B) Cumulative distribution of undirected degree P> (ft) ~ fc" 7+1 with 7 » 1.97. (C) 
Cumulative distribution of betweeness centrality P>(b) ~ 6 _<5+1 with 5 ~ 1.57. 



i. e. the total number of messages exchanged between node i and the rest of the community. 
The correlation of centrality measures with local measures (such as undirected degree ki) can 
be used to asses the impact of global forces on network dynamics. 

Figure H shows two social networks recovered with the above method. We can appreciate 
an heterogeneous pattern of e-mail interaction, where a few nodes generate a large fraction of 
e-mail replies. The undirected degree distribution is a power-law P(k) ~ fc~ 7 with 7 ks 2 (see 
fig. 0J3). These social networks exhibit a clear hump for large degrees. Betweeness centrality 
displays a long tail P(b) ~ b~ s with an exponent S between 1.3 and 1.8 (see table I and 
also fig. H was shown that betweeness centrality scales with degree in the network of 

Internet autonomous systems and in the Barabasi- Albert network [12], as b(k) ~ k~ v . From 
the cumulative degree distribution, i. e. 

POO 

P>(fc) = / P(k)dk - fc 1 " 7 (4) 

Jk 

and the corresponding integrated betweenness, with P>(b) ~ 6 1_<5 , it follows that rj = 
(7 — l)/(6 — 1) [13]. The social networks studied here display a similar scaling law with 
an exponent r] slightly departing from the theoretical prediction (see fig. |2K and table I). 



Project 


N 


L 


(k) 


7 


S 


V 


(7-l)/(5-l) 


Python 


1090 


3207 


2.94 


1.97 


1.57 


1.59 


1.70 


Gaim 


1415 


2692 


1.9 


1.97 


1.8 


1.24 


1.21 


Slashcode 


643 


1093 


1.69 


1.88 


1.58 


1.42 


1.51 


PCGEN 


579 


1654 


2.85 


2.04 


1.67 


1.54 


1.55 


TCL 


215 


590 


2.74 


1.97 


1.33 


2.34 


2.93 



Table I - Topological measures performed over large OS weighted nets. The two last columns at left 
compare the observed v exponent with the theoretical prediction r\ = (7 — l)/(<5 — 1) (see text). 
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Fig. 3 - Correlations in the Python OS community. (A) Average degree of nearest neighbors vs degree 
{k„„) ~ k° where 9 « 0.75 (open circles). The social network is dissasortative from the structural 
point of view. However, the weighted average nearest neighbors degree (solid circles) captures more 
precisely the level of affinity in the community (see text). Instead, traffic is redirected to the core 
subset of highly connected nodes (backbone). (B) Average clustering vs degree (C(k)} ~ kP with 
/3«1. ' 



However, a detailed analysis reveals a number of particular features intrinsic to these social 
networks. 

Correlations and Hierarchy in OS Networks. - A remarkable feature of software commu- 
nities is their hierarchical structure (see figHJ, which introduces non-trivial correlations in the 
social network topology. We can detect the presence of node-node correlations by measuring 
the average nearest-neighbors degree k nn {k) = J2k< k'P(k\k') where P(k\k') is the conditional 
probability of having a link attached to nodes with degree k and k' . In the absence of corre- 
lations, P(k\k') is constant and does not depend on k. Here, the average nearest-neighbors 
degree decays as a power-law of degree, (k nn ) ~ k~ e with ms 0.75 (see fig. ). This decreas- 
ing behaviour of k nn denotes that low-connected nodes are linked to highly connected nodes 
(see fig. GJ\.) and thus, these networks are dissasortative from the topological point of view. 
However, the same networks are assortative when we analyze edge weights. We have observed 
that frequent e-mail exchanges take place between highly connected members. Following [11], 
we define the weighted average nearest-neighbors degree, 

1 k 

knn,i = /] w ij kj (5) 
3=1 

where neighbor degree kj is weighted by the ratio (iwy/sj). According to this definition, 
knn,i > knn if strong edges point to neighbors with large degree and k™ ni < k nn otherwise. 
In software communities, weighted average nearest-neighbors degree is almost uncorrelated 
with node degree, that is, k nn ,i ~ constant (see figGK). Low connected nodes have weak 
edges because k™ n t (k) is only slightly larger than k nn {k) for small k (see figOK)- The social 
network is assortative because strong edges attach to nodes with many links, i.e., the difference 
knn i(k) ~ k nn (k) is always positive and increases with degree k. The hierarchical nature of 
these graphs is well illustrated from the scaling exhibited by the clustering C(k) against k, 
which scales as C(k) ~ \jk (see fig. |3j3), consistently with theoretical predictions [14]. 

Nonlocal Evolution of OS Networks. A very simple model predicts the network dy- 
namics of software communities, including the shape of the undirected degree distribution 
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Fig. 4 - Social network simulation (A) Linear correlation between node strength Si and betweeness 
centrality (or node load) bi in the Python community. The correlation coefficient is 0.99. This trend 
has been observed in all communities studied here (B) Estimation of a in the TCL project (see 
text). (C) Cumulative degree distribution in the simulated network (open circles) and in the real 
community (closed squares). All parameters estimated from real data: N = 215, mO = 15, (to) = 3 
and a — 0.75. Notice the remarkable agreement between simulation and the real social network. (D) 
Scaling of average neighbors degree vs degree in the simulated network (open circles) and in the real 
social network (closed squares). Notice the remarkable overlap of simulation and real data for large 
k. (E) Rendering of the simulated network to be compared with the social network displayed in fig. 
IB. 



P(k) and local correlations (see fig|lp, fig^p, and fig'Ef5). The system starts (as in real OS 
systems) from a fully-connected network of mo members. At each time step, a new member 
joins the community and a new node is added to the social network. The new member re- 
ports a small number of m e-mails (describing new software bugs). These new e-mails will 
be eventually replied by expert community members. Member experience is estimated with 
node strength s$ or the total number of messages sent (and received) by the member % (eq. 
Ipjjl). In addition, any member takes into account all previous communications regarding any 
particular software bug. This suggests that node strength is determined in a nonlocal man- 
ner [15]. Indeed, we observe a linear correlation between strength Sj and betweeness centrality 
bi in software communities (see fig. The probability that individual i replies to the new 
nember is proportional to the node load bi, 

where c is a constant (in our experiments, c = 1) and node load bi is recalculated before 
attaching the new link, that is, before evaluating eq. ©. A similar model was presented 
in [15], where bi is recalculated only after the addition of the new node and its (m) links. 
Here, the recalculation of betweenness centralities represents a global process of information 
diffusion. Once the target node i is selected, we place a new edge linking node i and the new 
node. 
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The networks generated with the previous model are remarkably similar to real OS net- 
works. For example, fig0] compares our model with the social network of TCL software 
community. The target social network has N = 215 members and m = (k) ps 3. A simple 
modification to a known algorithm for measuring preferential attachment in evolving net- 
works [16] enables us to estimate the exponent a driving the attachement rate of new links 
(described in eq. ©)• Due to limitations in available network data we have computed the 
attachment kernel depending on node strength Si instead of node load hi. In order to measure 
IT [si(t)] we compare two network snapshots of the same software community at times To and 
T\ where T < T\. Nodes in the T and T\ network are called "T nodes" and "Tx nodes", 
respectively. When a new i £ Tx node joins the network we compute the node strength Sj of 
the j £ Tq node to which the new node i links. Then, we estimate the attachment kernel as 
follows 

J2 m i:j e(s - sj) 
n[s ' T °' Tl] = E B(s- Sj ) (?) 

where 9(z) = 1 if z = and 6(z) — otherwise, and mij is the adjacency matrix of the 
social network. In order to reduce the impact of noise fluctuations, we have estimated the a 
exponent from the cumulative function 

s 

A(s) = J U(s)ds. (8) 
o 

Under the assumption of eq. 0the above function scales with node strength, A(s) ~ s a+1 . 
Figure0j3 displays the cumulative function A(s) as measured in the TCL software community 
with To = 2003 and T\ = 2004. In this dataset, the power-law fitting of A(s) predicts an 
exponent a = 0.75. A similar exponent is observed in other systems (not shown). In addition, 
we have estimated the as a exponent with a preferential attachment kernel, II(fc) ~ k aBA , as 
in the original algorithm by Jeong et al. [16]. The evolution of the social networks cannot 
be described by a linear preferential attachment mechanism because the observed exponent is 
c±ba > 1.4 (not shown). 

Discussion. The analysis of correlations in open source communities indicates they 
are closer to the Internet and communication networks than to other social networks (e.g., 
the network of scientific collaborations ). The social networks analyzed here are dissasortative 
from the topological point of view and assortative when edge weights are taken into account. A 
distinguished feature of social networks in software communities is a subset of core members 
acting like the community backbone. In these communities, the bulk of e-mail traffic is 
redirected to the strongest members, which are reinforced as the dominant ones. 

We have presented a model that predicts many global and local social network measure- 
ments of software communities. Interestingly, the model suggests that reinforcement is non- 
local, that is, e-mails are not independent of previous e-mails. The conclusions of the present 
work must be contrasted with the local reinforcement mechanism proposed by Caldarelli et. 
al. [8]. In their model, any pair of members can increase the strength of their link with 
independence of the global activity. Several features of software communities preclude the ap- 
plication of their model. For example, fixing a software bug is a global task which requires the 
coordination of several members in the community. Any e-mail response requires to consider 
all the previous communications regarding the specific subject under discussion. In addition, 
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their model does not consider a sparse network structure and every individual is connected 
with everybody else, which is not the case of OS communities. 

We can conceive other alternatives instead of computing betweeness centralities in eq.lJfU- 
An interesting approach includes the discrete simulation of e-mails tracing shortest paths in the 
social network, as in some models of internet routing [17]. Packet transport-driven simulations 
can provide good estimations of the number of e-mails received by any node. Nevertheless, 
the present model enables us to explain remarkably well the OS network dynamics. Another 
extension of the model is the addition of new links between existing nodes, which can provide 
better fittings to local correlation measures. Finally, the current model is a first step towards a 
theory of collaboration and self-organization in open source communities. In this context, the 
techniques and models presented here are useful tools to understand how social collaboration 
takes place in distributed environments. 
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